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ABSTRACT The relationship between the selection affecting codon usage and selection on protein sequences of orthologous genes 
in diverse groups of bacteria and archaea was examined by using the Alignable Tight Genome Clusters database of prokaryote 
genomes. The codon usage bias is generally low, with 57.5% of the gene-specific optimal codon frequencies (F^p,) being below 
0.55. This apparent weak selection on codon usage contrasts with the strong purifying selection on amino acid sequences, with 
65.8% of the gene-specific dN/dS ratios being below 0.1. For most of the genomes compared, a limited but statistically significant 
negative correlation between F^p^ and dN/dS was observed, which is indicative of a link between selection on protein sequence 
and selection on codon usage. The strength of the coupling between the protein level selection and codon usage bias showed a 
strong positive correlation with the genomic GC content. Combined with previous observations on the selection for GC-rich 
codons in bacteria and archaea with GC-rich genomes, these findings suggest that selection for translational fine-tuning could be 
an important factor in microbial evolution that drives the evolution of genome GC content away from mutational equilibrium. 
This type of selection is particularly pronounced in slowly evolving, "high-status" genes. A significantly stronger link between 
the two aspects of selection is observed in free-living bacteria than in parasitic bacteria and in genes encoding metabolic enzymes 
and transporters than in informational genes. These differences might reflect the special importance of translational fine-tuning 
for the adaptabUity of gene expression to environmental changes. The results of this work estabUsh the coupling between protein 
level selection and selection for translational optimization as a distinct and potentiaUy important factor in microbial evolution. 

IMPORTANCE Selection affects the evolution of microbial genomes at many levels, including both the structure of proteins and 
the regulation of their production. Here we demonstrate the coupling between the selection on protein sequences and the opti- 
mization of codon usage in a broad range of bacteria and archaea. The strength of this coupling varies over a wide range and 
strongly and positively correlates with the genomic GC content. The cause(s) of the evolution of high GC content is a long- 
standing open question, given the universal mutational bias toward AT. We propose that optimization of codon usage could be 
one of the key factors that determine the evolution of GC-rich genomes. This work establishes the coupUng between selection at 
the level of protein sequence and at the level of codon choice optimization as a distinct aspect of genome evolution. 
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The amino acid sequences of the great majority of proteins 
evolve under the pressure of purifying selection that can be 
measured through the ratio of the rates of nonsynonymous and 
synonymous substitutions (dN/dS) in protein-coding sequences 
(i-4). The strength of purif/ing selection shows broad variation 
between sites within a protein-coding gene, between genes within 
an evolving genome, and between evolving genomes in different 
organismal lineages (5-8). Generally, purifying selection is strong 
in organisms with large effective population sizes, such as bacteria, 
but substantially weaker in organisms with small effective popu- 
lation sizes, such as multicellular eukaryotes (9, 10). Within a 
bacterial or archaeal genome, which typically encompasses be- 
tween 1,000 and 10,000 protein-coding genes, the dN/dS ratio 
varies within approximately 2 orders of magnitude, from -0.01 to 
~ 1 .00, with the mean and median of the distribution being close to 
0.1 (11-13). Furthermore, comparative analysis of the dWdS ra- 



tios across a broad range of bacterial and archaeal genomes that 
were collected in the database of Alignable Tight Genome Clusters 
(ATGC) (14) has shown that the median dN/dS ratio is stable 
within each ATGC but differs between ATGC, with the implica- 
tion that this ratio is a robust, lineage-specific gauge of purifj'ing 
selection (12). 

The use of dN/dS as a measure of selection on protein se- 
quences is based on the assumption that synonymous substitu- 
tions are neutral. This assumption can be a reasonable approxi- 
mation inasmuch as selection affecting nonsynonymous sites is 
substantially stronger than that affecting synonymous sites. How- 
ever, it is well established that synonymous sites in protein-coding 
sequences actually are subject to selection driven by at least two 
factors, RNA secondary structure and codon usage (15-18). The 
study of codon usage bias (CUB) is a long-standing direction in 
molecular evolution. Two fundamentally different but not mutu- 
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ally exclusive types of explanations for the existence of CUB have 
been explored, namely, mutational (neutral) and selectional ori- 
gins. The important contribution of neutral mutational processes 
is suggested by the observations that GC content is the variable 
that best explains the interspecies differences in codon usage (19, 
20). Moreover, it has been shown that CUB in bacteria could be 
predicted from the nucleotide composition of intergenic regions 
(20). However, there are also multiple strong indications of the 
important role of selection in the evolution of codon usage. The 
key early observations, made primarily on classical model organ- 
isms, the bacterium Escherichia coli, and the yeast Saccharomyces 
cerevisiae, are compatible with the selectionist but not the neutral 
hypothesis: CUB is particularly strong in highly expressed genes, 
and the usage of a particular codon strongly correlates with the 
abundance of the cognate tRNAs (21-25). 

Subsequent research in this area to a large extent concentrated 
on the nature and strength of the selection that affects CUB (26, 
27). It has been reported that in enterobacteria, CUB is strongly 
and negatively correlated with the synonymous evolution rate, 
i.e., genes with strong CUB typically evolve slowly; in contrast, 
little correlation was detected between CUB and the rate of pro- 
tein evolution (28, 29). However, a more recent analysis of a range 
of model organisms, including E. coli and several eukaryotes, has 
revealed roughly the same strength of inverse correlation between 
CUB and dN compared to dS (30). Hartl and colleagues applied 
population genetic theory to estimate the selection coefficient on 
synonymous codon positions in enterobacteria and arrived at val- 
ues on the order of 10~^, indicative of weak selection that, how- 
ever, could be consequential in large bacterial populations (31). 

Two major factors underlying selection for CUB have been 
considered, namely, accuracy and rate of translation. The impor- 
tance of translation accuracy was first suggested by experimental 
data indicating that codon choice strongly affected the rate of 
amino acid misincorporation during translation (32-34). Subse- 
quently, it has been demonstrated that evolutionarily conserved 
amino acid sites showed a significantly stronger CUB than variable 
sites, as one would expect if selection acted to minimize the effect 
of amino acid misincorporation (35, 36). However, there are also 
substantial indications that selection for an increased rate or, 
more precisely, efficiency of translation contributes to the evolu- 
tion of CUB. Indeed, optimal codons appear to be translated faster 
than suboptimal ones (37). Although this difference might not 
substantially affect the actual rate of protein production, which 
appears to be determined primarily by the rate of translation ini- 
tiation (38), acceleration of elongation increases the supply of free 
ribosomes, a growth rate-limiting parameter in bacteria (26, 39). 
Indeed, a strong inverse correlation between codon bias and bac- 
terial generation time has been detected, suggesting that the use of 
optimal codons is essential for fast growth (40-42). tRNA modi- 
fications also enhance translation speed and/or accuracy in differ- 
ent codon groups (43). A recent analysis of the codon usage of 
yeast took advantage of ribosome profiling data to show that op- 
timal codons were actually not translated faster than suboptimal 
codons in vivo (44). Instead, it has been shown that, under condi- 
tions of tRNA shortage, the primary determinant of translation 
efficiency was the usage of codons proportional to the abundance 
of the cognate tRNAs (44). Analysis of codon usage in diverse 
bacteria by a recently developed statistical method yielded indica- 
tions that selection for translation efficiency made a substantially 



greater contribution to the evolution of CUB than selection for 
translation accuracy (45). 

The regulatory effects of CUB on cellular processes are likely to 
be multifaceted and remain only partially explored. For example, 
a recent study of the expression of bacterial operons that encode 
protein complexes with uneven subunit stoichiometry has shown 
that CUB is a key factor that provides for higher expression of the 
more abundant subunits (46). 

Overall, the current view of CUB evolution centers around the 
selection-mutation-drift model, according to which there is (rel- 
atively) weak selection for preferred (major or optimal) codons 
but nonpreferred codons persist owing to mutational bias and 
genetic drift (26, 47-49). The strength of selection on CUB ap- 
pears to vary broadly both across genes and across species, and 
translation accuracy and translation efficiency are both subject to 
selection, although the relative contributions of these two factors 
remain a matter of debate. 

We were interested in exploring the connection between selec- 
tion on CUB and selection on protein sequences. Generally, one 
would expect that the selective pressures at the two levels are cou- 
pled, given that high-expression genes, on the one hand, show a 
greater CUB than low-expression genes (21-25) and, on the other 
hand, on average evolve slowly (50-53). 

However, previous studies have not resulted in certainty with 
regard to the existence and strength of this coupling, largely be- 
cause CUB (measured as the fraction of optimal codons, F^p,) has 
been shown to depend similarly on dS and dN, with both depen- 
dencies thought to be uniformly gauged by the effective popula- 
tion size of an organism (30). 

We performed a broad survey of the correlations between 
dWdS and _F„p, in bacteria and archaea, taking advantage of the 
database of ATGC, which encompasses groups of closely related 
genomes across the diversity of bacteria and archaea ( 14). We find 
that there is a nearly universal inverse correlation between these 
two variables; i.e., the two levels of selection are coupled. The 
strength of this coupling depends on the genomic GC content, 
suggesting that fine-tuning of translation efficiency and fidelity, 
especially in highly expressed genes, is an important factor in the 
evolution of the GC content of microbial genomes away from 
mutational equilibrium. 

RESULTS 

Universal coupling between selection on codon usage and selec- 
tion on amino acid sequences and its dependence on genomic 
GC content. We first calculated the f^j^f value and the dJV/dS ratio 
of each pair of orthologous genes in a randomly selected pair of 
genomes from each ATGC (or the only pair for the ATGC con- 
sisting of two genomes; see Materials and Methods for details). 
The CUB was found to be relatively low, with f being below 0.55 
for 57.5% of the genes (Fig. 1). This relatively weak selection on 
codon usage contrasts with the typically strong purifying selection 
on amino acid sequences, with 65.8% of the dJV/dS ratios being 
below 0.10 (Fig. 1). 

For the substantial majority of the 120 ATGC analyzed, a sta- 
tistically significant negative correlation between the gene-specific 
Fgpf value and the dN/dS ratio was detected (Fig. 1). As shown 
previously, the genome-wide median dN/dS ratio is a stable char- 
acteristic of an ATGC (12). Therefore, we used the median dWdS 
ratios and F^p, values of all of the genes in each ATGC (that is, of a 
random pair of genomes in the case of a large ATGC) as ATGC 
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ATGC066 
ATGC126 
ATGC062 
ATGC065 
ATGC057 
ATGC061 
ATGC055 
ATGC042 
ATGC034 
ATGC091 
ATGC040 
ATGC025 
ATGC124 
ATGC052 
ATGC129 
ATGC077 
ATGC122 
ATGC121 
ATGC008 
ATGC027 
ATGC116 
ATGC123 
ATGC032 
ATGC039 
ATGC112 
ATGC038 
ATGC058 
ATGC006 
ATGC053 
ATGC051 
ATGC060 
ATGC041 
ATGC119 
ATGC007 
ATGC021 
ATGCOll 
ATGC026 
ATGC036 
ATGC117 
ATGC005 
ATGC114 
ATGC043 
ATGC029 
ATGC059 
ATGC013 
ATGC125 
ATGC104 
ATGC107 
ATGC047 
ATGCOlO 
ATGC023 
ATGC096 
ATGC009 
ATGC098 
ATGC035 
ATGClOO 
ATGC109 
ATGC080 
ATGC016 
ATGC014 
ATGC063 
ATGC094 
ATGC002 
ATGC086 
ATGC115 
ATGC102 
ATGC089 
ATGC131 
ATGCllO 
ATGC022 
ATGC017 
ATGC037 
ATGC064 
ATGC018 
ATGC076 




Sample species 

Geobacterbemidjiensis_Bem_uid58749 
Bifidobacterium_longum_NCC2705_uid57939 

Rhizobium_leguminosarum_bv viciae_3841_uid57955 

Desulfovibrio_vulgaris_Hildenborough_uid57645 

Bradvrhizobium_ORS278_uid5S941 

Sinorhizobium_melilotL1021_uid57603 

Methylobacterium_extorquens_PAl_uid58821 

Acidovorax_ebreus_TPSY_uid59233 

Xanthomonas_oryzae_PX099A_uid59131 

Geobacillus_kaustophNus_HTA426_uid58227 

Burkholderia_xenovorans_LB400_uid57823 

Pseudomonas_aeruginosa_PA01_uid57945 

Salinispora_tropica_CNB_440_uid58565 

Rhodobacter_sphaeroides_ATCC_17029_uid58449 

Clavibacter_michiganensis_sepGdonicus_uid61577 

Synechococcus_WH_8102_uid61581 

Mycobacterium_vanbaaleniLPYR_l_uid58463 

Mycobacterium_avium_paratuberculo5is_K_10_uid57699 

Erwinia_amylovora_CFBP1430_uid46839 

PsGudomonas_syringae_phaseolicola_1448A_uid58099 

Thermus_thermophilus_HB8_uid58223 

Mycobacterium_MCS_uid58465 

Stenotrophomonas_maltophilia_R551_3_uid58657 

Burkholderia_cenocepacia_HI2424_uid58369 

Anaeromyxobacter_dehalogGnans_2CP_l_uid58989 

Burkholderia_pseudomallei_1710b_uid58391 

Rhodopseudomonas_palustris_TIE_l_uid58995 

Klebsiella_variicola_At_22_uid42113 

Roseobacter_litoralis_Och_149_uid54719 

Anaplasma_marginale_iVlaries_uid57629 

Bartonella_tribocomm_CIP_105475_uid59129 

Ralstonia_solanacearum_MolK2_uid32085 

Mycobacterium_ulcerans_Agy99_uid62939 

Cronobacter_turicensis_z3032_uid40821 

Aeromonas_salmonicida_A449_uid58631 

Dickeya_zeae_Echl591_uid59297 

Pseudomonas_putida_KT2440_uid57843 

Janthinobacterium_Marseille_uid58603 

Corynebacterium_glutannicum_ATCC_13032_uid61611 

Escherichia_coli_UTI89_uid58541 

Roseiflexus_castenholzii_DSM_13941_uid58287 

Bordetella_parapertussis_12822_uid57615 

Psychrobacter_cryohalolentis_K5_uid58373 

Brucella_suis_1330_uid57927 

Edwardsiella_tarda_EIB202_uid41819 

Propionibacterium_acnes_SK137_uid48071 

LactobacillusJohnsonii_NCC_533_uid58029 

Listeria_monocytogenes_serotype_4b_F2365_uid57689 

Neorickettsia_sennetsu_Miyayama_uid57965 

Pectobacterium_wasabiae_WPP163_uid41297 

Shewanella_pealeana_ATCC_700345_uid58705 

Streptococcus_pyogenes_MGAS5005_uid58337 

Yersinia_pseudotuberculosis_PBl uid59153 

Streptococcus_mutans_UA159_uid57947 

Neisseria_mGningitidis_alphal4_uid61649 

Streptococcus_suis_Pl_7_uid32235 

Staphylococcus_aureus_JH9_uid58455 

Clostridium_perfringens_ATCC_13124_uid57901 

Haemophilus_influenzae_Rd_KW20_uid57771 

Actinobacillus_pleuropneumoniae_serovar_7_AP76_uid59231 

Zymomonas_mobilis_ZM4_uid58095 

Lactococcus_lactis_cremoris_SKll_uid57983 

MethanocaldococcusJannaschii_DSM_2661_uid57713 

Mycoplasma_agalactiae_uid46679 

Dehalococcoides_ethenogenes_195_uid57763 

Lactobacillus_rhamnosus_Lc_705_uid59315 

Bacillus_subtilis_168_uid57675 

Porphvromonas_gingivalis_W83_uid57641 

Thermotoga_petrophila_RKU_l_uid58655 

Shewanella_MR_7_uid58343 

Vibrio_fischeri_MJll_uid58907 

Polynucleobacter_necessarius_asymbioticus_QLW_PlDMWA_l_uid58611 
Gluconacetobacter_diazotrophicus_PAI_5_uid61587 
Vibrio_cholerae_MJ_1236_uid59387 
Prochlorococcus marinus MIT 9313 uid57773 



FIG 1 Quantitative cliaracteristics of genome evolution of the 120 ATGC analyzed. R is the Spearman rank coefficient of correlation between P^^, and dN/dS 
in an ATGC. Column P includes the P values of R for each ATGC. F^p, is the median F^p, in an ATGC. dN/dS is the median dN/dS ratio in an ATGC. GS is the 
genome size of sample species. GC% is the GC content of sample species. deltaGC stands for AGC (see Results). The color code is explained at the bottom. 
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Xylella_fa5tidiosa_Temeculal_uid57869 
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Clostridium_botulinum_F_Langeland_uid58929 

Vibrio_parahaemolvticus_RIMD_2210633_uid57969 

Borrelia_garinM_PBi_uid58125 

Ureaplasma_parvum_serovar_3_ATCC_700970_uid57711 

aostridium_difficile_R20291_uid40921 

Clostridium_botulinum_E3_Alaska_E43_uid59157 

Rickettsia_typhi_Wilmington_uid58063 

Ehrlichia_ruminantium_Welgevonden_uid58243 

Legionella_pneumophila_Paris_uid58211 
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Wolbachia_wRi_uid59371 

Campylobacterjejuni_RM1221_uid57899 

Dictyoglomus_turgidum_DSM_6724_uid59177 

Mycoplasma_hvopneumoniae_J_uid58059 

Prochlorococcus_marinus_pastoris_CCMP1986_uid57761 

Prochlorococcus marinus MIT 9301 uid58437 
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FIG 1 (Continued) 



properties; for the sake of simplicity, here we use dN/dS and F^p, to 
denote these median values. The dN/dS and f^^f values of the 120 
ATGC showed a limited but statistically significant negative cor- 
relation (Fig. 2; Spearman's p = —0.251, P = 0.0058 [Spearman 
test] ; here we denote this correlation coefficient R). Thus, the gen- 
erally expected coupling between selection at the level of protein 
sequences and selection at the level of codon usage indeed seems 
to exist across a broad range of bacterial and archaeal genomes. 
Having established the existence of the coupling between the 



two levels of selection, we sought to identify its possible underly- 
ing causes. Given that codon usage depends strongly on genomic 
GC content, which itself is strongly positively correlated with ge- 
nome size (GS) (54) (Fig. 3, Spearman's p = 0.661, P < 2.2e-16), 
we turned to principal-component analysis (PCA) with five vari- 
ables, dN/dS, f^pc GS, and GC content (GC%). The first prin- 
cipal component explained more than half of the variation in the 
data, with the main contributions, with opposite signs, coming 
from GC content and R (Fig. 4). However, GS also makes a sub- 
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FIG 2 Correlation between the genomic median dNI dS and f values of the 
120 ATGC. 
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FIG 4 PCA of the 120 ATGC in the space of five variables, dNIdS, F„p„ R, GC 
content, and GS. 



stantial contribution to principal component 1, conceivably be- 
cause of the strong correlation between GC content and GS 
(Fig. 3). Principal component 2, which explains 22.5% of the data 
variance, reflects primarily the opposite contributions of Fopt and 
dNI AS, in agreement with the observed negative correlation (Fig. 2 
and 3). 

Pairwise correlation analysis showed that by far the strongest 
correlation exists between GC content and R (Fig. 5a), followed by 
the correlation between GS and R (Fig. 5b). Notably, the median 
AN I AS showed a relatively weak, albeit significant, negative corre- 
lation with GC content and GS (Fig. 5c and d), whereas there was 
no significant correlation between f and either of these genomic 
characteristics (Fig. 5e and f). The peculiar, U-shaped dependence 
of Fgp, on GC content most likely reflects the paucity of codon 
choices in extremely AT-rich and extremely GC-rich genomes, 
resulting in an inflation of F^^, values that does not reflect selective 
processes. Thus, the strong dependence of R on GC content ap- 
pears to be a distinct phenomenon, with the implication that cou- 
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pling between selection on protein sequence and selection on 
codon usage is a selectable trait in itself 

To further explore the potential biological underpinning of the 
strong connection between R and genomic GC content, we used 
the parameter AGC, which was defined as the difference in GC 
content between optimal and nonoptimal codons as follows: 

AGC = 2jf 1 ( X GC^p, - f„p, X GC„'„„„p, ) 
Here the sum is taken over 18 amino acids with more than one 
codon for all of the orthologous gene pairs in a given ATGC; 
and fl,„„gp, are the frequencies of the optimal and nonoptimal 
codons of amino acid i, respectively; GC„p, and GCl^^^p, are, re- 
spectively, the GC contents of the optimal and nonoptimal codons 
of amino acid ;; f„„„„p, X GC,l„„gp, is the mean of all nonoptimal 
codes for amino acid i This parameter was designed to reflect the 
strength of selection for increased GC content in the optimal 
codon that could underlie the strong correlation between R and 
GC content. When the AGC values were plotted against the GC 
content for the 120 ATGC, a peculiar, nonmonotonic dependence 
was observed (Fig. 6). Whereas for low-GC genomes, AGC slightly 
decreased with the GC content, upward of -45% GC, a steady 
increase in AGC was observed ( Fig. 6 ) . The small effect at a low GC 
content is likely to be purely statistical, caused by the strong bias 
toward AT. In contrast, at a high GC content, there seems to be 
strong selection for increased GC content of the optimal codons. 
Thus, the selection on codon bias indeed appears to be particularly 
pronounced in bacteria and archaea with GC-rich genomes. 

Dependence of coupling between the two levels of selection 
on lifestyle, biological function, and taxonomy of prokaryotes. 
We farther investigated possible connections of the coupling be- 
tween the selection on amino acid sequences and on codon usage 
with various biological features of prokaryotes, including optimal 
growth temperature, cell shape, sporulation capacity, motility, 
and oxygen requirement. None of these biological properties 
showed a significant link with R (data not shown). It appeared 
particularly plausible that the coupling between the two levels of 
selection would be linked to the optimal growth rate (time be- 
tween cell divisions under optimal growth conditions) of a mi- 
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crobe (55). Again, however, no connection between this parame- 
ter and R was found to exist (Fig. 7). 

In contrast, the partitioning of Proteobacteria (the most exten- 
sively sequenced bacterial phylum) into pathogens and nonpatho- 
gens revealed a significantly stronger coupling among nonpatho- 
gens (Fig. 8). No such connection of dN/dS or F^p, values was 
detected. In contrast, a significant difference between pathogenic 
and non-pathogenic bacteria was observed also with respect to GC 
content, with a higher GC content in nonparasites (Fig. 8). These 
observations are compatible with the conclusion that (i) coupling 
of the selective processes at the protein and codon levels and (ii) 
GC content are subject to the same or related selective pressures. 

We further explored the coupling in different functional 
classes of genes by using the coarse-grain classification imple- 
mented in the Clusters ofOrthologous Groups (COG) system (56, 
57). Although the differences between functional classes of genes 
were small in magnitude, genes that encode proteins related to 
metabolic activities (enzymes and transporters) consistently 
showed stronger coupling than informational genes encoding 
components of the translation, transcription, and replication sys- 
tems (Fig. 9a). The difference was found to be statistically signifi- 
cant when the metabolic genes were pooled and collectively com- 
pared to informational genes (Fig. 9b). 



Finally, we compared the strengths of the coupling between 
different bacterial and archaeal phyla (Fig. 10). Significant differ- 
ences were detected, with Actinobacteria showing particularly 
strong coupling, in contrast to the weak coupling in Cyanobacteria 
and Firmicutes. Among the two most extensively sequenced phyla, 
Proteobacteria showed significantly stronger coupling than Firmi- 
cutes. 

DISCUSSION 

The results of the present analysis demonstrate the coupling be- 
tween selection forces that affect protein sequences and codon 
usage. This relationship could be readily anticipated from previ- 
ous observations on the relationships between gene expression 
level and protein sequence conservation on the one hand and CUB 
on the other (30). The coupling between selection on protein se- 
quence and selection on codon usage can be interpreted as a fine- 
tuning of translation via CUB that depends on the "status" of a 
gene in an organism. "High-status" genes that are highly ex- 
pressed tend to occupy central positions in various biological net- 
works and typically evolve slowly (65), the selection for transla- 
tional fine-tuning apparently is measurably stronger than it is in 
lower-status genes, resulting in the observed negative correlations 
between dN/dS and f „f. 
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FIG 7 Dependence of R on optimal growth rates of bacteria and archaea. The 
optimal growth rate data (time between divisions under optimal growth con- 
ditions) are from reference 55. 



The main, nontrivial observation in this work is that the 
strength of the coupling between the two levels of selection is 
effectively determined by the genomic GC content. It has been 
shown that mutational processes in aU organisms are biased to- 
ward AT accumulation, so by inference, high GC content results 
from selection (58). The nature of this selection is not fully under- 
stood, but apparently, CUB is an important optimization crite- 
rion, as demonstrated by the finding that in bacteria, CUB tracks 
the nucleotide composition of the intergenic regions and in par- 
ticular, that in sufficiently GC-rich genomes, the optimal codons 
typically contain G or C in synonymous positions. Moreover, the 
bias toward GC-rich codons is the strongest in highly expressed 



genes, such as those encoding translation system components 
(59). A subsequent, updated analysis indicates that the GC content 
in synonymous positions of codons tends to be higher than that in 
intergenic regions and that GC enrichment in synonymous posi- 
tions without changing protein sequences results in increased fit- 
ness of bacteria expressing the respective genes (60). The results of 
the present study add an extra dimension to these observations by 
showing that the dependence of the translational fine-tuning on 
gene status is strongly correlated with the genomic GC content. In 
other words, in GC-rich genomes, the difference between the lev- 
els of translational fine-tuning in high- and low-status genes is 
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FIG 8 Comparison of the R values, dMdS ratios, F„p, values, GSs, and GC contents of pathogens and nonpathogens in the phylum Proteohacteria. Of the 61 
proteobacteria, 38 were classified as pathogens and 16 were classified as nonpathogens (data are fi'om the GOLD [http://www.genomesonline.org/] and PATRIC 
[http://patric.vbi.vt.edu/] databases; the remaining 7 species were not classified in either of the two categories; see Table SI in the supplemental material). The 
R values and GC contents of pathogens differ significantly from those of nonpathogens. 



March/April 2014 Volume 5 Issue 2 e00956-14 



mfiio' mbio.asm.org 7 



Ran et al. 



-0.15 



cell&infor functional groups of COGs 
I metabolism functional groups of COGs 



-0.20 



-0.25 



-0.30 



-0.30 




I ' I ' I ' I ' I ' I ' I ' I ' I ' I ' I ' I ' I ' I ' I ' I ' I ' I ' 

% % % % % %\. %/% w \ \ % % \ 



•TO 



4, ^ 



"^4 



FIG 9 Correlation of the f„^, values and dN/dS ratios (R values) of different functional classes of genes of prokaryotes. The data for all 120 ATGC were pooled, 
(a) Comparison of R values of two broad categories of genes, those encoding proteins involved in information processing and cellular functions (e.g., cell 
division) (cell&infor) and those encoding proteins involved in metabolism (enzymes and transporters). The mean R value of the metabolic class (red box) is 
— 0.242, which is significantly greater than the mean R value of —0.223 of the informational class (P< 0.02). (b) Comparison of Rvalues of individual functional 
categories of genes. The functional categories are from the COG classifications. 



greater than it is in AT-rich genomes, resulting in the observed 
strong correlation between strength of coupling and GC content. 

This conclusion is clearly supported by the dependence of AGC 
on the genomic GC content (Fig. 6). 

Although historically it is customary to speak of GC content 
determining other features of genomes, the causality could be re- 
versed, with the selection for G and C at synonymous sites, which 
is particularly strong in high-status genes, driving the evolution 
toward high GC content (60). Furthermore, the overall optimiza- 
tion of the translational landscape of a microbial genome might 
enable the accumulation of genes via horizontal gene transfer and 
duplication, resulting in the strong positive correlation between 
GC content and GS (54) (Fig. 3). 

In addition to demonstrating the coupling between protein 



level selection and CUB and its dependence on GC content, we 
observed that the strength of this coupling differs for parasites 
versus nonparasites and for genes encoding metabolic proteins 
versus those encoding informational proteins. Although subtle, 
these differences were found to be significant and did not appear 
to be by-products of the GC content connection. One might hy- 
pothesize that translational fine-tuning shows a stronger depen- 
dence on gene status in organisms and genes that are involved in 
frequent adaptation to changing environments and that this fine- 
tuning is particularly important in genes directly involved in such 
adaptation. Furthermore, some of the prokaryotic phyla signifi- 
cantly differ in the strength of coupling, which is suggestive of 
additional links with lifestyle and physiology. 

The overall outcome of this analysis identifies the coupling 
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FIG 10 Comparison of R values of major prokaryotic taxa. The taxa shown 
belong to Actinobacteria (ACT), Archaea (ARC), the Bacteroidetes-Chlorobi 
group (BCg), Cyanobacteria (CYA), Firmicutes (FIR), other phyla (OTH), Pro- 
teobacteria (PRO), Spirochaetes (SPI), Tenericutes (TEN), and all phyla (ALL). 
One-way analysis of variance was used to test if any two groups differ signifi- 
cantly from each other. At the 0.05 level, the correlation in Actinobacteria is 
significantly stronger than those in the Bacteroidetes-Chlorobi group, Cyano- 
bacteria, Firmicutes, Proteobacteria, Spirochaetes, Tenericutes, and other phyla, 
and the correlation in Archaea and Proteobacteria is significantly stronger than 
those in Cyanobacteria, Firmicutes, and Tenericutes. 



between selection processes that act at the level of proteins and at 
the level of codon usage as a distinct characteristic of prokaryotic 
genome evolution. The strength of this coupling is tightly linked 
to genomic GC content and could be an important determinant of 
the nucleotide composition of genomes, the evolution of which 
remains poorly understood. 

The biological factors behind the wide range of the strengths of 
coupling between the two levels of selection, from very strong 
negative correlation in many groups of microbes to a positive 
correlation in a few groups (Fig. 1 and 2), remain unclear. Ex- 
plaining the nature of this variance and connecting it to specifics 
of microbial biology is a challenge for further research. Given the 
stronger coupling observed for operational genes than for infor- 
mational genes (Fig. 9), it appears plausible that fine-tuning of 
CUB is subject to stronger selection in microbes whose lifestyle 
includes adaptation to changing environments that requires rapid 
protein dosage adjustment via translational regulation. 

MATERIALS AND METHODS 

The ATGC database and genome sequences. The ATGC database was 
built in 2009 and included 446 prokaryotic genomes and 104 ATGC (14). 
We updated the data set to include 1,390 genomes and 120 ATGC; Firmi- 
cutes and Proteobacteria account for 63.3% of the genomes (880/1,390) 
because of the relative paucity of sequenced genomes from other phyla. 
All of the pairs of orthologous genes in this database are synteny- 
supported bidirectional best hits (12, 14, 61). Altogether, 2,817,540 or- 
thologous gene pairs were analyzed. 

Selection of genome pairs for analysis. Suppose there are m species in 
an ATGC with the same gene number, n, which is the simplest case. The 
number of orthologous gene pairs is then n X {m!/[2!(m — 2)!]}. Obvi- 
ously, the number of orthologous gene pairs rapidly increases with the 
number of species in an ATGC. If all of the orthologous gene pairs from all 
ATGC were taken into account in this analysis, the results would have 
been strongly biased toward large ATGC. Thus, we randomly chose a pair 
of species from each ATGC containing more than two species and used 



the orthologous gene pairs from these two species as a representative 
sample of the given ATGC. 

Calculation of parameters. For each orthologous gene pair in an 
ATGC, protein sequences were aligned with MUSCLE (62), and the pro- 
tein alignment was used to generate the alignment of the respective nu- 
cleotide sequences extracted from the genomic sequences by using a cus- 
tom script. Maximum-likelihood approximation (codeML) was used to 
calculate AS and AN (63). In order to eliminate those orthologous gene 
pairs for which the estimates of the parameters was deemed unreliable 
either because of the small number of substitutions or conversely because 
of extreme divergence, the gene pairs with a dN value of <0.0002, a AS 
value of <0.0002, a AS value of >3, or a ANIAS ratio of >3 were discarded. 
The orthologous gene pairs in which the lengths of the two genes differed 
by more than 20% (presumably because of gene misannotation) were 
discarded as well. F^p, is a widely used measure of CUB (64). The F^^,, 
values of genes in an orthologous gene pair are very close in most cases 
(data not shown) . Thus, the mean f^^^ value of two orthologous genes was 
taken as the f value for that gene pair. 

SUPPLEMENTAL MATERIAL 

Supplemental material for this article may be found at http://mbio.asm.org 
/lookup/suppl/doi:10.1128/mBio.00956-14/-/DCSupplemental. 
Table 81, PDF file, 0.1MB. 
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